European Radiology — Latest Matching Preprints

1

SCOPE: AI-Assisted Early Detection of Potentially Curable Pancreatic Neoplasms on CT from Local and Global Information

Oviedo, F.; Lopez Ramirez, F.; Blanco, A.; Facciola, J.; Kwak, S.; Zhao, J. M.; Syailendra, E. A.; Tixier, F.; Dodhia, R.; Hruban, R. H.; Weeks, W. B.; Lavista Ferres, J. M.; Chu, L. C.; Fishman, E. K.

2026-02-05 radiology and imaging 10.64898/2026.02.04.26345495 medRxiv

Top 0.1%

22.8%

Show abstract

PurposeTo develop SCOPE (Small-lesion COntextual Pancreatic Evaluator), a deep learning model designed to improve CT detection of small pancreatic lesions--pancreatic ductal adenocarcinoma (PDAC), pancreatic neuroendocrine tumors (PanNETs), and cystic lesions--by integrating voxel-level features with global context. Materials and MethodsThis retrospective study used three independent datasets. A development cohort of 4,065 contrast-enhanced CT scans was used to train a deep neural network that performs pancreas, ductal, and lesion segmentation with an integrated classification head. A metamodel combined segmentation-derived and global contextual signals for case-level prediction. Performance was assessed on (1) an internal holdout test set (n = 605), (2) an external multi-institutional PDAC dataset from the PANORAMA challenge (n = 2,238), and (3) an expert-curated small-lesion reader study (n = 200). Areas under the receiver operating characteristic curve (AUCs) were compared using DeLong test; sensitivities and specificities using McNemars test. ResultsOn the internal test set, SCOPE improved lesion-versus-normal AUC compared with the best segmentation baseline (0.974 [95% CI: 0.964, 0.984] vs 0.956; P = .006) and increased small-lesion sensitivity at 95% specificity (0.727 [95% CI: 0.653, 0.801] vs 0.600; P = .012). Performance gains were observed across lesion classes, with significant improvements for PDAC and PanNET detection. On the external dataset, SCOPE improved PDAC-versus-non-PDAC AUC (0.978 vs 0.861, P < .001) and achieved higher sensitivity at 90% and 95% specificity without retraining. For the small-lesion reader study, SCOPE achieved lesion-versus-normal AUC of 0.922 and performed within the range of subspecialty abdominal radiologists; SCOPE provided the correct diagnosis in 14.5% (29/200) of cases in which two or more readers were incorrect. ConclusionSCOPE improves early detection of small, potentially curable, pancreatic lesions on CT by combining local segmentation and global pancreatic context. Its consistent performance across internal, external, and reader datasets supports potential use as a concurrent reader for earlier and more accurate pancreatic lesion detection.

2

Development and validation of a deep learning model for the automated detection of vertebral artery calcification on non-contrast head-and-neck computed tomography

Ueda, Y.; Okazaki, T.; Isome, H.; Patel, A.; Ichimasa, T.; Asaumi, R.; Kawai, T.; Suyama, K.; Hayashi, S.

2026-03-17 radiology and imaging 10.64898/2026.03.15.26348421 medRxiv

Top 0.1%

22.7%

Show abstract

BackgroundVertebral artery calcification (VAC), a critical indicator of cerebrovascular disease, is often overlooked in head-and-neck imaging. Manual detection is time-consuming and prone to inter-observer variability. This study aimed to develop and validate a deep learning model for automated detection and quantitative risk assessment of VAC in non-contrast head-and-neck computed tomography (CT) images, bridging the diagnostic gap between dentistry and vascular medicine. MethodsWe developed a deep learning model based on the ResNet-18 architecture, designated as Grayscale ResNet, optimized for single-channel CT images. The development followed a two-phase strategy: initial training on 539 axial images from head-and-neck CT image followed by iterative refinement (fine-tuning) using a targeted dataset of clinically significant cases to ensure generalizability. The models performance was evaluated using patient-level Receiver Operating Characteristic (ROC) analysis and saliency map visualization for clinical interpretability. ResultsThe optimized model demonstrated a robust performance in distinguishing between cases with and without VAC. In the independent cohort, the model achieved an area under the curve (AUC) of 0.846. At a specific threshold value (98.6%), the system yielded a sensitivity of 80.0% and a specificity of 90.6%. A saliency map analysis confirmed that the model consistently focused on anatomically relevant vascular regions. ConclusionsThe proposed automated system provides an accurate and reliable method for VAC screening using routine head-and-neck CT scans. By transforming incidental imaging findings into a quantifiable risk index, this tool can serve as a vital decision-support system for dentists and radiologists, facilitating early patient referrals and contributing to global stroke prevention.

3

AI-Based Clinical Decision Support Systems for Secondary Caries on Bitewings: A Multi-Algorithm Comparison

Chaves, E. T.; Teunis, J. T.; Digmayer Romero, V. H.; van Nistelrooij, N.; Vinayahalingam, S.; Sezen-Hulsmans, D.; Mendes, F. M.; Huysmans, M.-C.; Cenci, M. S.; Lima, G. d. S.

2026-04-25 dentistry and oral medicine 10.64898/2026.04.17.26350883 medRxiv

Top 0.1%

19.8%

Show abstract

Background: Radiographic detection of caries lesions adjacent to restorations is challenging due to limitations of two-dimensional imaging and difficulties distinguishing true lesions from restorative or anatomical radiolucencies. Artificial intelligence (AI)-based clinical decision support systems (CDSSs) have been introduced to assist radiographic interpretation; however, different AI tools may yield variable diagnostic outputs, and their comparative performance remains unclear. Objective: To compare the diagnostic performance of commercial and experimental AI algorithms for detecting secondary caries lesions on bitewings. Methods: This cross-sectional diagnostic accuracy study included 200 anonymized bitewings comprising 885 restored tooth surfaces. A consensus group reference standard identified all surfaces with a caries lesion and classified each lesion by type (primary/secondary) and depth (enamel-only/dentin-involved). Five commercial (Second Opinion, CranioCatch, Diagnocat, DIO Inteligencia, and Align X-ray Insights) and three experimental (Mask R-CNN-based and Mask DINO-based) systems were tested. Diagnostic performance was expressed through sensitivity, specificity, and overall accuracy (95% CI). Comparisons used generalized estimating equations, adjusted for clustered data. Results: Specificity was high across all systems (0.957-0.986), confirming accurate recognition of non-carious surfaces, whereas sensitivity was moderate (0.327-0.487), reflecting frequent missed detections of enamel and dentin lesions. Accuracy ranged from 0.882 to 0.917, with no significant differences among models (p >= 0.05). Confounding factors, such as radiographic overlapping, marginal restoration defects, and cervical artifacts, were the main sources of misclassification. Conclusions: AI algorithms, regardless of architecture or commercial status, showed similar diagnostic capabilities and a conservative detection profile, favoring specificity over sensitivity. Improvements in dataset diversity, labeling precision, and explainability may further enhance reliability for secondary caries detection. Clinical Significance: AI-based CDSSs assist clinicians by providing consistent detection. Their high specificity is particularly valuable in minimizing unnecessary invasive treatments (overtreatment), though they should be used as adjuncts rather than a replacement for expert judgment.

4

Deep Neural Patchworks Predict Renal Imaging Biomarkers from Non-Contrast MRI via Knowledge Transfer from Arterial-Phase Contrast-Enhanced MRI

Kästingschäfer, K. F.; Fink, A.; Rau, S.; Reisert, M.; Kellner, E.; Nolde, J. M.; Kottgen, A.; Sekula, P.; Bamberg, F.; Russe, M. F.

2026-02-26 radiology and imaging 10.64898/2026.02.24.26346961 medRxiv

Top 0.1%

18.7%

Show abstract

Rationale and ObjectivesContrast-enhanced (CE) MRI provides clear corticomedullary contrast for renal compartment delineation but may be contraindicated or undesirable in routine practice. We aimed to enable automated extraction of renal imaging biomarkers from routine non-contrast-enhanced (NCE) T1-weighted MRI by transferring CE-derived compartment labels. Materials and MethodsThis retrospective single-center study (January 2017 to December 2021) included 200 participants with paired arterial-phase CE and NCE T1-weighted MRI. Cortex, medulla, and sinus were manually segmented on CE MRI and rigidly transferred to NCE MRI to provide voxel-level reference labels. A hierarchical 3D Deep Neural Patchworks model was trained on 100 examinations (90 training/10 validation) and evaluated on an independent test set of 100 examinations using the transferred CE masks on NCE as reference. Performance was assessed using Dice similarity of segmentations and biomarker agreement using volumes and surface areas (Pearson/Spearman, MAE, Lins CCC, and Bland-Altman). ResultsWhole-kidney segmentation Dice was 0.950 (left) and 0.953 (right). Total kidney volume showed high agreement with minimal bias (MAE 8.76 mL, 2.5% of mean; CCC 0.983; bias -1.56 mL; 95% limits of agreement -28.81 to 25.69 mL). Cortex volume was modestly overestimated and medulla volume underestimated, shifting predicted compartment fractions toward cortex (74.7% vs. 72,1% in ground truth; medulla 21.5% vs. 24.3%; sinus 3.8% vs. 3.6%. Sinus volume maintained high concordance despite higher Dice dispersion. Surface area was systematically underestimated with low concordance. ConclusionCE-supervised knowledge transfer enables accurate, well-calibrated kidney volumetry from routine NCE MRI and supports contrast-free renal biomarker extraction. Surface area estimation remains challenging. Take-home MessagesO_LICE-supervised label transfer enables accurate, well-calibrated contrast-free kidney volumetry on routine non-contrast T1-weighted MRI. C_LIO_LICompartment volumetry is feasible but shows systematic cortex overestimation and medulla underestimation; surface area remains non-interchangeable due to boundary uncertainty. C_LI

5

Impact of Image Bit Depth Reduction on Deep Learning Performance in Chest Radiograph Analysis: A Multi-institutional Study

Takita, H.; Mitsuyama, Y.; Walston, S. L.; Saito, K.; Sugibayashi, T.; Okamoto, M.; Suh, C. H.; Ueda, D.

2026-03-09 radiology and imaging 10.64898/2026.03.07.26347853 medRxiv

Top 0.1%

17.3%

Show abstract

PurposeMedical imaging typically generates 12- to 16-bit formats, yet conversion to 8-bit is often required. While deep learning has been widely explored in medical imaging, the influence of image bit depth on model performance is not fully understood. This study evaluates the impact of conversion from 16-bit to 8-bit for sex, age, and obesity classification using deep learning. Materials and methodsIn this retrospective, multi-institutional study, we analyzed 100,002 chest radiographs from 48,047 participants across three institutions. Three convolutional neural network architectures (ResNet52, EfficientNetB2, and ConvNeXtSmall) were trained on both 16-bit and 8-bit versions of the images. Model performance was evaluated using internal test datasets, randomly split multiple times, and an external test dataset. Statistical analysis included paired comparisons of area under the receiver operating characteristic curve (AUC-ROC) values, with Bonferroni correction for multiple comparisons. ResultsAcross all architectures and classification tasks, differences between 16-bit and 8-bit model performance were minimal (mean differences ranging from -0.218% to 0.184%). Statistical analyses revealed no significant differences in AUC-ROC values between bit depths for any model-task combination (all p-values > 0.05 after Bonferroni correction). Effect sizes were small to moderate (Cohens d ranging from -0.415 to 0.391). ConclusionReducing image bit depth from 16-bit to 8-bit does not significantly impact the performance of deep learning models in chest radiograph analysis. These findings suggest that 8-bit images can be used for deep learning applications in medical imaging without compromising model performance, potentially allowing for more efficient data storage and processing.

6

Accessible and Reproducible Renal Cell Carcinoma Research Through Open-Sourcing Data and Annotations

de Boer, S.; Häntze, H.; Ziegelmayer, S.; van Ginneken, B.; Prokop, M.; Bressem, K. K.; Hering, A.

2026-04-23 radiology and imaging 10.64898/2026.04.22.26351451 medRxiv

Top 0.1%

15.2%

Show abstract

Background: Medical imaging, especially computed tomography and magnetic resonance imaging, is essential in clinical care of patients with renal cell carcinoma (RCC). Artificial intelligence (AI) research into computer-aided diagnosis, staging and treatment planning needs curated and annotated datasets. Across literature, The Cancer Genome Atlas (TCGA) datasets are widely used for model training and validation. However, re-annotation is often necessary due to limited access to public annotations, raising entry barriers and hindering comparison with prior work. Methods: We screened 1915 CT scans from three TCGA-RCC databases and employed a segmentation model to annotate kidney lesion. After a meta-data-based exclusion step, we hosted a reader study with all papillary (n=56), chromophobe (n=27) and 200 randomly selected clear cell RCC cases. Two students quality checked and corrected the data as well as annotated tumors and cysts. Uncertain cases were checked by a board-certified radiologist. Results: After data exclusion and quality control a total of 142 annotated CT scans from 101 patients (26 female, 75 male, mean age 56 years) remained. This includes 95 CTs with clear cell RCC, 29 with papillary RCC and 18 with chromophobe RCC. Images and voxel-level annotations of kidneys and lesions are open sourced at https://zenodo.org/records/19630298. Conclusion: By making the annotations open-source, we encourage accessible and reproducible AI research for renal cell carcinoma. We invite other researchers who have previously annotated any of these cohorts to share their annotations.

7

Usages and perceptions of artificial intelligence among French radiologists

Jean, A.; Benillouche, P.; Jacques, T.

2026-03-26 radiology and imaging 10.64898/2026.03.23.26348621 medRxiv

Top 0.1%

14.4%

Show abstract

This study analyzes the adoption, barriers, and expectations of French radiologists regarding the use of Artificial Intelligence (AI) solutions in their daily practice. Despite a recognition of AI's potential to make radiology more precise, predictive, and personalized, its adoption remains limited. The main obstacles identified are the high cost of those solutions and the insufficient equipment of French imaging centers with AI technologies. Nevertheless, the survey reveals a strong willingness to adopt, with over 70% of radiologists expressing their desire to use AI and 0% declaring a refusal to use it. Furthermore, the radiologists' fears of being replaced by AI are very low (0 to 8.8%).

8

Clinical validation of automated and multiple manual callosal angle measurement methods in idiopathic normal pressure hydrocephalus

Seo, W.; Jabur Agerberg, S.; Rashid, A.; Holmstrand, N.; Nyholm, D.; Virhammar, J.; Fallmar, D.

2026-02-14 radiology and imaging 10.64898/2026.02.12.26346185 medRxiv

Top 0.1%

14.3%

Show abstract

IntroductionIdiopathic normal pressure hydrocephalus (iNPH) is a partially reversible neurological disorder in which imaging biomarkers support diagnosis and surgical decision-making. The callosal angle (CA) is one of the most robust radiological markers of iNPH and has also been associated with postoperative shunt outcome. However, several manual measurement variants exist and artificial intelligence (AI)-based tools now enable automatic CA measurement. Materials and MethodsIn total 71 patients (40 with confirmed iNPH and 31 controls) were included. Six predefined manual methods for measuring CA were applied to preoperative 3D T1-weighted MRI and evaluated for diagnostic performance and interobserver agreement. An AI-derived automatic CA (cMRI from Combinostics) was included as a seventh method and compared with the traditional manual method (perpendicular to the bicommissural plane and through the posterior commissure). Automatic measurements were additionally assessed in pre- and postoperative scans to evaluate robustness against shunt-related artifacts. ResultsAll seven CA variants significantly differentiated iNPH patients from controls (p < 0.05). The traditional method showed the highest discriminative performance (AUC = 0.986, SE = 0.012), while alternative planes demonstrated slightly lower accuracy (AUC range = 0.957-0.978). Interobserver agreement for manual measurements was good to excellent (ICC = 0.687-0.977). Automatic CA measurements showed excellent correlation with the traditional method, preoperative ICC = 0.92; postoperative ICC = 0.96. ConclusionAlthough several CA positions perform comparably, the traditional method remains marginally superior and is best supported by the literature. Automated CA measurements closely match expert manual assessment in pre- and postoperative imaging, supporting clinical implementation.

9

Automated Detection of Dental Caries and Bone Loss on Periapical and Bitewing Radiographs using a YOLO Based Deep Learning Model

Alqaderi, H.; Kapadia, U.; Brahmbhatt, Y.; Papathanasiou, A.; Rodgers, D.; Arsenault, P.; Cardarelli, J.; Zavras, A.; Li, H.

2026-04-17 dentistry and oral medicine 10.64898/2026.04.12.26350726 medRxiv

Top 0.1%

13.5%

Show abstract

BackgroundDental caries and periodontal disease represent the most prevalent global oral health conditions, collectively affecting several billion people. The diagnostic interpretation of dental radiographs, a cornerstone of modern dentistry, is associated with considerable inter-observer variability. In routine clinical practice, clinicians are required to evaluate a high volume of radiographic images daily, a cognitively demanding task in which diagnostic fatigue, time constraints, and the inherent complexity of overlapping anatomical structures can lead to the inadvertent oversight of early-stage pathologies. Artificial intelligence (AI) offers a transformative opportunity to augment clinical decision-making by providing rapid, objective, and consistent radiographic analysis, thereby serving as a tireless adjunct capable of flagging findings that may be missed during routine human inspection. MethodsThis study developed and validated a deep learning system for the automated detection of dental caries and alveolar bone loss using a dataset of 1,063 periapical and bitewing radiographs. Two separate YOLOv8s object detection models were trained and evaluated using a rigorous 5-fold cross-validation methodology. To align with the clinical use-case of a screening tool where high sensitivity is paramount, a custom image-level evaluation criterion was employed: a true positive was recorded if any predicted bounding box had a Jaccard Index (IoU) > 0 with any ground truth annotation. Model performance was systematically evaluated at confidence thresholds of 0.10 and 0.05. ResultsAt a confidence threshold of 0.05, the caries detection model achieved a mean precision of 84.41% ({+/-}0.72%), recall of 85.97% ({+/-}4.72%), and an F1-score of 85.13% ({+/-}2.61%). The alveolar bone loss model demonstrated exceptionally high performance, with a mean precision of 95.47% ({+/-}0.94%), recall of 98.60% ({+/-}0.49%), and an F1-score of 97.00% ({+/-}0.46%). ConclusionThe YOLOv8-based models demonstrated high accuracy and high sensitivity for detecting dental caries and alveolar bone loss on periapical radiographs. The system shows significant potential as a reliable automated assistant for dental practitioners, helping to improve diagnostic consistency, reduce the risk of missed pathology, and ultimately enhance the standard of patient care.

10

AI-based radiomics for pancreatic cysts: high diagnostic performance amid a persistent translational gap

Lettner, J. D.; Evrenoglou, T.; Binder, H.; Fichtner-Feigl, S.; Neubauer, C.; Ruess, D. A.

2026-02-12 radiology and imaging 10.64898/2026.02.10.26345995 medRxiv

Top 0.1%

10.6%

Show abstract

BackgroundAI-based radiomics has demonstrated promising diagnostic performance for pancreatic cystic neoplasms, yet clinical translation remains limited. Whether this reflects insufficient model performance or structural limitations of the evidence base remains unclear. MethodsWe performed a systematic review and diagnostic test accuracy meta-analysis of AI-based radiomics in pancreatic cyst (2015-2025), addressing two clinically relevant tasks (Q1: cyst type differentiation/Q2: malignancy or high-grade dysplasia prediction). Training and validation datasets were synthesized independently using hierarchical models. Study evaluation extended beyond diagnostic performance to a four-dimensional framework integrating RQS 2.0, METRICS, TRIPOD+AI and PROBAST+AI explicitly contrasting pooled diagnostic performance with reporting quality, methodological rigor, and risk of bias. The review was pre-registered (PROSPERO) and conducted according to PRISMA 2020. ResultsTwenty-nine studies were included (Q1: n = 15; Q2: n = 14), predominantly retrospective and single center. Training-based analyses showed high apparent diagnostic performance for Q1 (pooled sensitivity/specificity: 0.89 [95% CI, 0.85-0.92]/ 0.90 [0.85-0.93]), but there was substantial heterogeneity ({tau}{superscript 2} = 0.56/0.78; {rho} = 0.38). Validation-based performance remained high (0.86 [0.82-0.89]/ 0.88 [0.81-0.93]), while heterogeneity persisted and prediction regions exceeded confidence regions. Training-based analyses demonstrated similarly high apparent performance (0.88 [0.79-0.95]/0.89 [0.81-0.94]) for Q2, with pronounced heterogeneity ({tau}{superscript 2} = 1.98/1.61; {rho} = 0.63). Validation-based performance was slightly lower, yet still clinically comparable (0.82 [0.75-0.89]/0.86 [0.80-0.91]), and heterogeneity persisted ({tau}{superscript 2} = 0.71/0.43; {rho} = 0.15). Across both tasks, high diagnostic accuracy occurred alongside incomplete reporting, limited validation and an elevated risk of bias. ConclusionAI-based radiomics for pancreatic cysts has reached a structural performance plateau. Further improvements in diagnostic accuracy alone are insufficient to achieve clinical translation and must be accompanied by a paradigm shift from performance-driven model development toward decision-anchored study designs, robust validation strategies, transparent reporting standard, and clinically integrated evaluation frameworks. SummaryAlthough pancreatic cystic lesions are increasingly being detected, imaging-based decision-making remains limited, particularly regarding differentiating between cyst types and stratifying malignancy risk. In this PRISMA-compliant and PROSPERO-registered systematic review and meta-analysis of diagnostic tests, we evaluated the use of AI-based radiomics for these two tasks, as well as its contextualized performance. In addition, a four-dimensional framework was employed to conduct the evaluation, incorporating diagnostic accuracy, reporting quality, risk of bias, and radiomics maturity. Across studies published between 2015 and 2025, the pooled diagnostic performance was consistently high, with only modest declines observed from the training to the validation stage. Nevertheless, considerable heterogeneity between studies and limited transportability remained evident. Multidimensional evaluation indicated a systematic dissociation between reported performance and methodological robustness, characterized by incomplete reporting, restricted validation, and an elevated risk of bias. These limitations were consistent across both clinical questions and were not resolved by increasing model complexity. The findings of this meta-analysis suggest that the structural performance of AI-based radiomics for pancreatic cysts has plateaued. To progress towards clinical translation, it is necessary to employ study designs anchored in decision-making processes, robust multi-center validation, and transparent, reproducible evaluation frameworks. This is preferred to further optimization of model architecture alone.

11

A Retrospective Multi-Source Clinical Validation of Lenek Intelligent Radiology Assistant: An Artificial Intelligence-Based Chest Radiograph Screening and Triage System for High-Burden Pulmonary and Cardiac Conditions in India

Singh, V.; Jhamb, A.; Sil, S.; Kumar, S.; Agrawal, C.; Pareek, A.; Gautam, A.; Parale, G.; Singh, S.; Padmanabhan, D.

2026-03-16 radiology and imaging 10.64898/2026.03.14.26348373 medRxiv

Top 0.1%

9.3%

Show abstract

BackgroundA critical radiologist shortage exists in India, leading to delayed chest radiograph (CXR) interpretation. This leads to disease progression, higher morbidity, and mortality. Artificial intelligence-based CXR interpretation by Lenek Intelligent Radiology Assistant (LIRA) is a promising solution. This study aims to establish the screening and triaging capabilities of LIRA by assessing its accuracy in detecting abnormalities and pathologies in CXRs from geographically diverse institutions. MethodsWe conducted a retrospective multi-source validation of the diagnostic accuracy of LIRA for the detection of general abnormalities, tuberculosis, consolidation, pleural effusion, pneumothorax, and cardiomegaly. De-identified chest radiographs were input into LIRA models. The obtained interpretations were compared to the established ground truth reporting for the calculation of sensitivity, specificity, and AUROC with 95% CI for individual pathologies across varying probability thresholds. ResultsLIRA demonstrated high sensitivity for general abnormality detection (AUROC 0.93-0.986, 84.4-97.1% sensitivity, 88.9-92.4% specificity) and tuberculosis triaging (Shenzhen & Montgomery: 88.5-89.7% sensitivity, 89.9-90.5% specificity; Jaypee: 98.7% sensitivity, 63.6% specificity). For consolidation (AUROC 0.884-0.895, 96.4-96.9% sensitivity, 70.8-77.1% specificity), pleural effusion (AUROC 0.942-0.967, 79.7-99.1% sensitivity, 81.2-87.7% specificity), pneumothorax (AUROC 0.87, 90.6-94.8% sensitivity, 79.5-82.7% specificity) and cardiomegaly (AUROC 0.883, 95.1% sensitivity, 81.6% specificity), the model exhibited commendable accuracy as well. ConclusionsThe diagnostic performance of LIRA was consistent across various pathologies and chest radiographs from diverse geographic locations, with particular strengths in abnormality detection and tuberculosis screening. The risk-stratified triaging and high sensitivity of LIRA make it a reliable adjunct solution to address radiologist shortages, reduce turnaround times, and support Indias tuberculosis elimination goals.

12

Image Quality Evaluation of Neonatal Brain MRI Using a Deep Learning Reconstruction Algorithm: A Quantitative and Multireader Study Using Variable Denoising Levels at 3 Tesla

Alvi, Z.; Reis, E. P.; Shin, D. D.; Banerjee, S.; Dahmoush, H. M.; Campion, A.; Esmeraldo, M. A.; Chambers, S.; Kravutske, Y.; Gatidis, S.; Soares, B. P.

2026-02-09 radiology and imaging 10.64898/2026.02.04.26345479 medRxiv

Top 0.1%

8.6%

Show abstract

PurposeNeonatal imaging is particularly challenging because newborns have a high likelihood of head motion, which can degrade image quality and complicate interpretation. Improving MRI brain image quality may help reduce diagnostic uncertainty and facilitate the nuanced assessment of early myelinating structures in the neonatal brain. Although deep learning reconstruction algorithms designed to improve MRI image quality have been evaluated in pediatric imaging, they have not been specifically studied in exclusively neonatal populations. We sought to evaluate image quality improvement through the employment of a deep learning reconstruction algorithm in neonatal brain imaging. Methods3D T1-weighted brain MRIs were obtained in 15 neonates. A deep-learning reconstruction algorithm was applied to the image sets using low, medium, and high levels of denoising. Three radiologists qualitatively rated image quality (signal-to-noise ratio, presence of artifacts, and overall clarity) on a 4-point scale of eight early myelinating structures. Objective apparent signal-to-noise ratio (aSNR) and apparent contrast-to-noise ratio (aCNR), based on signal intensities of white-and gray-matter, was measured across all three denoising levels. ResultsEvaluation by radiologists indicated an overall increase in all image quality categories and increased conspicuity of the early myelinating structures as the level of denoising increased. Objective aSNR and aCNR values also increased progressively with denoising, with significant differences observed for nearly all pairwise comparisons. ConclusionOur findings suggest that the use of the proposed deep learning reconstruction algorithm improves image quality in 3D T1-weighted neonatal brain MRIs at 3T.

13

Nationwide Organ Volume Reference Standards and Aging-Related Changes in Abdominal CT from Japan

Kikuchi, T.; Yamamoto, K.; Yamagishi, Y.; Akashi, T.; Hanaoka, S.; Yoshikawa, T.; Fujii, H.; Mori, H.; Makimoto, H.; Kohro, T.

2026-02-03 radiology and imaging 10.64898/2026.01.30.26345246 medRxiv

Top 0.1%

8.5%

Show abstract

BackgroundLarge-scale CT-based reference standards for abdominal organ volume, incorporating age, sex, and body size, are limited. PurposeTo establish sex- and age-specific reference distributions for major abdominal organ volumes on non-contrast abdominopelvic CT in a nationwide Japanese cohort to provide a foundation for automated clinical assessment and dose optimization. Materials and MethodsIn this retrospective, multicenter study, using the Japan Medical Image Database, we identified all non-contrast abdominopelvic CT examinations performed in 2024. Unique adults with available data on age, sex, height, and weight were included in this study. The final sample comprised 49,764 examinations (26,456 men and 23,308 women) conducted at nine institutions. Automated segmentation (TotalSegmentator v2.10.0) was used to produce organ volumes, excluding hollow viscera. The sex-specific 10th, 25th, 50th, 75th, and 90th percentiles were calculated. Age-volume relationships of body surface area (BSA)-normalized volumes (mL/m2) were modeled using natural cubic splines (four degrees of freedom) separately by sex. ResultsMedian (mL) male vs female volumes were as follows: liver, 1194.7 vs 1024.0; pancreas, 63.6 vs 52.2; spleen, 118.1 vs 95.1; kidneys (total), 268.3 vs 221.2; adrenals (total), 6.6 vs 4.2; iliopsoas (total), 483.4 vs 317.7; prostate, 24.9 (men only). Age-volume relationships of BSA-normalized volumes showed convex patterns for the liver, pancreas, and kidneys in both sexes and for male adrenal glands; lower values in older age groups for the spleen and iliopsoas in both sexes; and higher values in older age groups for the prostate and female adrenal glands. ConclusionThis nationwide Japanese CT cohort provides sex- and age-resolved volumetric reference standards. These standards enable objective identification of abnormalities, support personalized medicine, and facilitate automated AI-based reporting to reduce radiologist workload and optimize radiation dose protocols. Key ResultsO_LIMedian volumes (men vs women, mL): liver 1195/1024; pancreas 64/52; spleen 118/95; kidneys 268/221; adrenals 6.6/4.2; iliopsoas 483/318; prostate 25. C_LIO_LIBody surface area-normalized age-volume relationships were convex for liver, pancreas, and kidneys in both sexes and for male adrenal glands. C_LIO_LISpleen and iliopsoas declined monotonically with age in both sexes, whereas prostate and female adrenal glands increased monotonically. C_LI

14

The false positive paradox: Examining real-world clinical predictive performance of FDA-authorized AI devices for radiology using clinical prevalence

Sparnon, E.; Stevens, K.; Song, E.; Harris, R. J.; Strong, B. W.; Bruno, M. A.; Baird, G. L.

2026-03-27 radiology and imaging 10.64898/2026.03.25.26349197 medRxiv

Top 0.1%

8.5%

Show abstract

The present study evaluates the real-world clinical predictive performance of FDA-authorized artificial intelligence (AI) devices used in radiology, focusing on the false positive paradox (FPP) and its implications for clinical practice. To do this, we analyzed publicly available FDA data on AI radiology devices from 2024 and 2025 from 510(k) summaries, demonstrating how diagnostic accuracy metrics like sensitivity and specificity do not necessarily translate into high positive predictive value (PPV) due to the influence of target disease prevalence. We show the importance of disclosing the false discovery (FDR) and false omission rates (FOR) and argue that this transparency enables clinicians to select AI systems that balance false positive and false negative costs in a clinically, ethically, and financially appropriate manner. Finally, we provide recommendations for what data should be provided to best serve practices and radiologists.

15

High-resolution disconnectome predicts outcome and response to thrombectomy in basilar artery occlusion

Authamayou, B.; Marnat, G.; Matsulevits, A.; Munsch, F.; Lavielle, A.; Courbin, N.; Foulon, C.; Chen, B.; Micard, E.; Gory, B.; L'Allinec, V.; Bourcier, R.; Naggara, O.; Lauze, E.; Boulouis, G.; Lapergue, B.; Eker, O.; Sibon, I. P.; Thiebaut de Schotten, M.; Tourdias, T.

2026-04-21 radiology and imaging 10.64898/2026.04.20.26350998 medRxiv

Top 0.1%

6.8%

Show abstract

BackgroundAcute basilar artery occlusion (BAO) causes devastating strokes. Despite the benefit of endovascular treatment, the optimal management remains sometimes controversial, such as for patients with mild deficits, and would benefit from robust prognostic tools. Given the dense white matter networks within the posterior fossa, we tested whether quantifying disconnections from acute diffusion-weighted imaging (DWI) could improve outcome prediction and responders to recanalization compared with conventional metrics. MethodsWe conducted a secondary analysis from a prospective multicenter stroke registry, including consecutive patients (2017-2024) with BAO and admission MRI. Ultra-high-resolution diffusion MRI was acquired in healthy participants to build normative tractograms with optimized posterior fossa quality. Patient infarcts delineated on DWI were projected onto these tractograms to estimate disconnected fiber volume. The primary outcome was 90-day modified Rankin Scale (mRS) 0-3 vs 4-6. Predictive performance of disconnected fiber volume was compared with baseline NIHSS, infarct volume, and posterior circulation ASPECTS (pc-ASPECTS) using logistic regressions and areas under receiver operating characteristic curves (AUC). Ordinal regressions tested associations across the full mRS spectrum, stratified by recanalization status. Analyses were repeated in patients with NIHSS [≤]10. ResultsAmong 201 patients (median age 70; NIHSS 10), 97 (48.3%) had poor outcome. Despite small median infarct volume (4.75 mL), disconnected fiber volume was substantial (median 25.15 mL). Disconnected fiber volume achieved an AUC of 0.84, outperforming NIHSS (0.67; p<0.0001), infarct volume (0.75; p=0.00059), and pc-ASPECTS (0.76; p=0.0127). Low disconnected fiber volume predicted better outcomes across the full mRS (OR=0.12 [95% CI, 0.065-0.204]) and greater benefit from successful recanalization (OR=0.33 [95% CI, 0.15-0.70]). In patients with NIHSS [≤]10 (n=102), disconnected fiber volume remained the strongest predictor (AUC=0.83). ConclusionsDisconnected fiber volume derived indirectly is a robust prognostic marker of BAO outcomes that outperforms conventional predictors and may support future treatment decisions. Registrationhttps://clinicaltrials.gov - NCT03776877.

16

External validation of self-supervised transfer learning for noninvasive molecular subtyping of pediatric low-grade glioma using T2-weighted MRI

Yoo, J. J.; Tak, D.; Namdar, K.; Wagner, M. W.; Liu, A.; Tabori, U.; Hawkins, C.; Ertl-Wagner, B. B.; Kann, B. H.; Khalvati, F.

2026-01-30 radiology and imaging 10.64898/2026.01.27.26344883 medRxiv

Top 0.1%

6.7%

Show abstract

PurposeTo externally evaluate three binary classification models designed to differentiate the molecular subtype of pediatric low-grade glioma (pLGG) between BRAF Fusion, BRAF Mutation, and Wild Type on T2-weighted magnetic resonance imaging using self-supervised transfer learning, which enables effective performance in a low data setting. Materials and methodsThis retrospective study evaluates pLGG molecular subtyping models, pre-trained using data collected at Dana Farber Cancer Institute/Bostons Childrens Hospital, on two datasets from the Hospital for Sick Children, one consisting of patients identified from the electronic health record between January 2000 to December 2018 (n=336) and another consisting of patients identified from the electronic health record between January 2019 to April 2023 (n=87). These datasets consist of T2-weighted MRI with pLGG and corresponding genetic marker identifications, labelled as BRAF Fusion, BRAF Mutation, or Wild Type. The datasets included manually annotated ground-truth segmentations that were used in the classification pipeline during evaluation. The models were evaluated using the area under the receiver operating characteristic curve (AUC). To acquire a per-class probabilities across all three considered molecular subtypes, we used the output probabilities from each binary model as logits input to a Softmax function. These probabilities were used to determine the AUC of the models on each evaluated dataset. ResultsThe models performed achieved a macro-average AUC of 0.7671 on the newer dataset from the Hospital for Sick Children but achieved a lower macro-average AUC of 0.6463 on the older dataset from the Hospital for Sick Children. ConclusionsThe evaluated pLGG molecular subtyping models have the potential for effective generalization but may require further fine-tuning for consistent performance across varying datasets.

17

MRI-Based Blood Clot Phenotyping: An In Vitro Study

Bechtel, G. N.; Das, A.; Noyer, J.; Bush, A. M.; Hormuth, D.; Yankeelov, T. E.; Castillo, E.; Warach, S.; Fuhg, J.; Tamir, J. I.; Saber, H.; Rausch, M. K.

2026-04-16 bioengineering 10.64898/2026.04.14.718500 medRxiv

Top 0.1%

6.5%

Show abstract

Background and PurposeNeurointerventional outcomes depend on clot composition and may be influenced by clot contraction. Thus, a priori identification of clot composition and contraction could inform procedural strategies and improve outcomes. The goal of our work is to conduct an in vitro test to determine whether MRI can reliably predict both clot composition and contractile state. Materials and MethodsTo this end, we prepared blood clots spanning clinically observed compositions (0-80% red blood cells (RBCs)) in both contracted and uncontracted states. Contraction was controlled by coagulating blood with or without thrombin. We imaged these clots using quantitative, clinical, and investigational MRI sequences. Using these data, we then determined whether MRI signal intensities, quantitative parameters, and radiomic features capturing intensity and texture patterns can (i) predict clot hematocrit and (ii) classify clots by composition (RBC-rich vs. fibrin-rich) and contraction state. ResultsQuantitative MRI parameters (T1, T2, ADC) decreased with increasing hematocrit (R2 = 0.56-0.85, p < 0.001), while signal intensities from clinical sequences showed weaker correlations (R2 = 0.46-0.62, p < 0.001). Radiomic models predicted hematocrit with performance comparable to MRI parameters. When applied to classification, radiomic features accurately discriminated RBC-versus fibrin-rich clots, with AUCs exceeding 0.90 across nearly all sequences. In contrast, classification of contraction state showed greater variability in AUCs across sequences but remained high for quantitative T1 and T2 values (AUCs up to 0.88). Trends were consistent across clots coagulated with and without thrombin. Pooling features across sequences did not outperform the best individual sequence for either regression or classification. ConclusionsWe demonstrate that MRI-based radiomic analysis quantitatively characterizes clot composition and contraction in vitro. These findings support the feasibility of using MRI for pre-interventional clot phenotyping, with potential to inform thrombolytic and mechanical thrombectomy strategies. Thus, in vivo studies validating these results are warranted.

18

Pneumonia Detection in Paediatric Chest X-Rays using Ensembled Large Language Models

Tan, J.; Tang, P. H.

2026-04-12 radiology and imaging 10.64898/2026.04.10.26347909 medRxiv

Top 0.1%

6.5%

Show abstract

Background: Paediatric pneumonia is a leading cause of childhood morbidity and mortality worldwide. Chest X-rays (CXR) are an important diagnostic tool in the diagnosis of pneumonia, but shortages in specialist radiology services lead to clinically significant delays in CXR reporting. The ability to communicate findings both to clinicians and laypersons allows MLLMs to be deployed throughout clinical workflows, from image analysis to patient communication. However, MLLMs currently underperform state-of-the-art deep learning classifiers. Objective: To evaluate the diagnostic accuracy of ensemble strategies with MLLMs compared to the baseline average agent for paediatric radiological pneumonia detection. Methods: We conducted a retrospective cohort study using paediatric CXRs from two independent hospital datasets totalling 2300 CXRs. Fifteen MedGemma-4B-it agents independently classified each CXR into five pneumonia likelihood categories. Majority voting, soft voting, and GPTOSS-20B aggregation were compared against the average agent performance. The primary metric evaluated was OvR AUROC. Secondary metrics included accuracy, sensitivity, specificity, F1-score, Cohen's kappa, and OvO AUROC. Results: Soft voting achieved improvements in OvR AUROC (p_balanced = 0.0002, p_real-world = 0.0003), accuracy (p_balanced = 0.0008, p_real-world < 0.0001), Cohen's Kappa (p_balanced = 0.0006, p_real-world = 0.0054) and OvO AUROC (p_balanced < 0.0001, p_real-world = 0.0011) across both datasets, and a superior F1-value (pbalanced = 0.0028) for the balanced dataset. Conclusion: Soft voting enhances MedGemma's diagnostic discriminatory performance for paediatric radiological pneumonia detection. Our system enables privacy-preserving, near real-time clinical decision support with explainable outputs, having potential for integration into emergency departments. Our system's high specificity supports triage by flagging high-risk radiological pneumonia cases.

19

A Deployable Explainable Deep Learning System for Tuberculosis Detection from Chest X-Rays in Resource-Constrained High-Burden Settings

Agumba, J.; Erick, S.; Pembere, A.; Nyongesa, J.

2026-04-01 radiology and imaging 10.64898/2026.03.31.26349662 medRxiv

Top 0.1%

6.4%

Show abstract

Abstract Objectives: To develop and evaluate a deployable deep learning system with Gradient-weighted Class Activation Mapping (Grad-CAM) for tuberculosis screening from chest radiographs and to assess its classification performance and explainability across desktop and mobile deployment platforms. Materials and methods: This study used publicly available chest X-ray datasets containing Normal and Tuberculosis images. A DenseNet121-based transfer learning model was trained using stratified training, validation, and test splits with data augmentation and class weighting. Model performance was evaluated using accuracy, precision, recall, F1 score, receiver operating characteristic (ROC) curve, and area under the ROC curve (AUC). Grad-CAM was used to visualize regions influencing model predictions. The trained model was converted to TensorFlow Lite and deployed in both a Windows desktop application and a Flutter-based mobile application for offline inference and visualization. Results: The model demonstrated strong classification performance on the independent test dataset, with high accuracy and AUC values indicating effective discrimination between Normal and Tuberculosis cases. Grad-CAM visualizations showed that the model focused primarily on anatomically relevant lung regions, particularly the upper and mid-lung fields in Tuberculosis cases. Deployment testing confirmed consistent prediction outputs and Grad-CAM visualizations across both Windows and mobile platforms. Conclusion: The proposed deployable deep learning system with Grad-CAM provides accurate and interpretable tuberculosis screening from chest radiographs and demonstrates feasibility for offline mobile and desktop deployment. This approach has potential as an artificial intelligence-assisted screening and decision support tool in radiology, particularly in resource-limited and remote healthcare settings.

20

Pre-procedural testing using patient-specific models is associated with high training fidelity and improved procedural efficiency in endovascular aneurysm treatment

Hofmeister, J.; Bernava, G.; Rosi, A.; Brina, O.; Reymond, P.; Muster, M.; Lovblad, K.-O.; Machi, P.

2026-04-24 radiology and imaging 10.64898/2026.04.23.26351592 medRxiv

Top 0.1%

5.2%

Show abstract

Background: Even for experienced operators, endovascular treatment of unruptured intracranial aneurysms involves intraoperative uncertainty that may lead to adjustments in strategy, prolong the procedure, and potentially cause inefficiency and device waste. This study aimed to evaluate whether pre-procedural testing (PPT) of endovascular treatment using patient-specific models was associated with increased operator confidence and perceived clinical utility, including improvements in procedural efficiency and reduced resource waste. Methods: We enrolled a cohort of patients who underwent PPT before endovascular treatment for complex unruptured intracranial aneurysms and compared their outcomes with a control group treated without PPT. The primary outcome was the Training Fidelity Score, a composite of three operator-reported Likert items defined a priori. Secondary outcomes included perceived clinical utility, intraoperative strategy changes, procedural time, radiation exposure, device waste and safety. Results: A total of 85 patients met the inclusion criteria (PPT=40; control=45). The Training Fidelity Score was high across the PPT group (median, 4.33/5). Perceived clinical utility was high and further increased significantly after the procedure. A significant reduction was observed in intraoperative strategy changes, with no changes recorded in the PPT group, compared to 6/45 in the control group (RR 0.09; p=0.027). Reductions in treatment time, radiation exposure and device waste were also noted. Conclusion: PPT using patient-specific models was associated with increased operator confidence, fewer intraoperative strategy changes, improved procedural efficiency, and reduced device waste without compromising safety. These findings support its use in pre-interventional preparation, but require prospective multicenter validation.